Using Envision's Automatic Hand Gesture Detection PyPi package (envisionhgdetector)¶


Wim Pouw (wim.pouw@donders.ru.nl)

isolated

Info¶

In the following notebook, we are going to simply use an envisionbox python package. This package is called "envisionhgdetector" and contains functions to automatically annotate gesture. In some other envisionbox module on training a gesture classifier we exhibited an end-to-end pipeline for training a model on particular human behaviors, e.g., head nodding, clapping; and then producing some inferences on new videos. This package builds further on that work. Namely, we have trained convolutional neural network to differientate no gestures (including self-adaptors), and a gesture. We do this based on the SAGA dataset, the Zhubo dataset, and the TED M3D dataset. Given that we have trained it on a bit of variability in terms of datasets and angles, and more than 9000 gestures, we can use this gesture detector to a little bit more varied settings than we could do would we have trained on a single dataset.

Now, don't get too excited! The performance is not extraordinary or anything, and it still awaits proper testing and further updating with better trained models (we are working on it...). Currently not differientating types of gestures (as far as that is possible; we are working on it...). But it is good enough for some purposes to have a quick pass over on a set of videos and get some prominent gestures out. Once we have the gestures, we can do all kinds of other interesting things, e.g., generate gesture kinematic statistics, or generate gesture networks. But now all automatically!

Package info¶

https://pypi.org/project/envisionhgdetector/

What does envisionhgdetecotor do¶

  • It tracks upper body, hands, and face landmarks (generating 29 features)
  • It makes an inference based on 25 frames of data, whether it labels no gesture (default implicit label), gesture (label: Gesture), or some kind of movement that is not a gesture (label: Move).
  • It outputs a labeled video, an ELAN file, a confidence timeseries, and a gesture segment list (with labels and start and end times).
  • UPCOMING: It will in the future add a bunch of analyses on the gestures it isolates

Installation¶

It is best to install in a conda environment.

conda create -n envision python = 3.9

conda activate envision

Then proceed:

pip install -r requirements.txt

Errors?¶

  1. Make sure you c++ redistributables installed for tensorflow to work: https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170#latest-microsoft-visual-c-redistributable-version

citation for this notebook¶

  • Pouw, W. (2024). EnvisionBOX modules for social signal processing (Version 1.0.0) [Computer software]. https://github.com/WimPouw/envisionBOX_modulesWP

Citation¶

If you use this package, please cite:

  • Pouw, W. (2024). envisionhgdetector: Hand Gesture Detection Using a Convolutional Neural Network (Version 0.0.2) [Computer software]. https://github.com/WimPouw/envisionhgdetector

Citations for the packages and datasets¶

Original Noddingpigeon Training code:

  • Yung, B. (2022). Nodding Pigeon (Version 0.6.0) [Computer software]. https://github.com/bhky/nodding-pigeon

Zhubo dataset (used for training):

  • Bao, Y., Weng, D., & Gao, N. (2024). Editable Co-Speech Gesture Synthesis Enhanced with Individual Representative Gestures. Electronics, 13(16), 3315.

SAGA dataset (used for training)

  • Lücking, A., Bergmann, K., Hahn, F., Kopp, S., & Rieser, H. (2010). The Bielefeld speech and gesture alignment corpus (SaGA). In LREC 2010 workshop: Multimodal corpora–advances in capturing, coding and analyzing multimodality.

TED M3D:

  • Rohrer, Patrick. A temporal and pragmatic analysis of gesture-speech association: A corpus-based approach using the novel MultiModal MultiDimensional (M3D) labeling system. Diss. Nantes Université; Universitat Pompeu Fabra (Barcelone, Espagne), 2022.

MediaPipe:

  • Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., ... & Grundmann, M. (2019). MediaPipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.

Lets get started¶

For this tutorial, I have two videos that I would like to segment for hand gestures. They all live in the folder: './videos_to_label/'

In [2]:
import os
import glob as glob

videofoldertoday = './videos_to_label/'
outputfolder = './output/'
In [3]:
import glob
from IPython.display import Video

# List all videos in the folder
videos = glob.glob(videofoldertoday + '*.mp4')
# Display single video
Video(videos[0], embed=True, width=200)
Out[3]:
Your browser does not support the video tag.
In [4]:
Video(videos[1], embed=True, width=200)
Out[4]:
Your browser does not support the video tag.

From the pypi package info we see that we can simply use this to get started:

from envisionhgdetector import GestureDetector

# Initialize detector
detector = GestureDetector(
    motion_threshold=0.8,    # Sensitivity to motion
    gesture_threshold=0.8,   # Confidence threshold for gestures
    min_gap_s=0.3,          # Minimum gap between gestures
    min_length_s=0.3        # Minimum gesture duration
)

# Process videos
results = detector.process_folder(
    video_folder="path/to/videos",
    output_folder="path/to/output"
)

play around¶

The gesture annotations can be finetuned with the settings you have:

  1. confidence level of movement
  2. if movement, then what is the confidence level for the gesture or move category
  3. when should gestures be merged (x second gap) into one
  4. what is the shortest gesture you want to consider (oterwhise remove)
In [4]:
from envisionhgdetector import GestureDetector
import os

# absolute path 
videofoldertoday = os.path.abspath('./videos_to_label/')
outputfolder = os.path.abspath('./output/')

# create a detector object
detector = GestureDetector(motion_threshold=0.9, gesture_threshold=0.9, min_gap_s =0.2, min_length_s=0.5)

# just do the detection on the folder
detector.process_folder(
    input_folder=videofoldertoday,
    output_folder=outputfolder,
)
WARNING:tensorflow:From c:\Users\u668173\Anaconda3\envs\envision\lib\site-packages\keras\src\backend\tensorflow\core.py:222: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

Successfully loaded weights from c:\Users\u668173\Anaconda3\envs\envision\lib\site-packages\envisionhgdetector\model\SAGAplus_gesturenogesture_trained_binaryCNNmodel_weightsv1.h5

Processing videoplayback (2).mp4...
Generating labeled video...
Generating elan file...
Done processing videoplayback (2).mp4, go look in the output folder

Processing videoplayback (2)_2_1.mp4...
Generating labeled video...
Generating elan file...
Done processing videoplayback (2)_2_1.mp4, go look in the output folder
Out[4]:
{'videoplayback (2).mp4': {'stats': {'average_motion': 0.8553642712561695,
   'average_gesture': 0.966857220036815,
   'average_move': 0.03314277811700271},
  'output_path': 'c:\\Users\\u668173\\Desktop\\wimpouwenvisionboxwp\\envisionBOX_modulesWP\\UsingEnvisionHGdetector_package\\output\\videoplayback (2).mp4.eaf'},
 'videoplayback (2)_2_1.mp4': {'stats': {'average_motion': 0.7676855239374508,
   'average_gesture': 0.9752756533218406,
   'average_move': 0.024724344355299285},
  'output_path': 'c:\\Users\\u668173\\Desktop\\wimpouwenvisionboxwp\\envisionBOX_modulesWP\\UsingEnvisionHGdetector_package\\output\\videoplayback (2)_2_1.mp4.eaf'}}
In [5]:
import pandas as pd
import os
# lets list the output
outputfiles = glob.glob(outputfolder + '/*')
for file in outputfiles:
    print(os.path.basename(file))

# load one of the predictions
csvfilessegments = glob.glob(outputfolder + '/*segments.csv')
df = pd.read_csv(csvfilessegments[0])
df.head()
labeled_videoplayback (2).mp4
labeled_videoplayback (2)_2_1.mp4
videoplayback (2).mp4.eaf
videoplayback (2).mp4_predictions.csv
videoplayback (2).mp4_segments.csv
videoplayback (2)_2_1.mp4.eaf
videoplayback (2)_2_1.mp4_predictions.csv
videoplayback (2)_2_1.mp4_segments.csv
Out[5]:
start_time end_time labelid label duration
0 0.000000 0.689655 1 Gesture 0.689655
1 1.413793 3.896552 2 Gesture 2.482759
2 5.000000 5.724138 3 Gesture 0.724138
3 6.931034 7.517241 4 Gesture 0.586207
4 7.551724 8.379310 5 Gesture 0.827586

now assess the labeled video data¶

In [5]:
from moviepy import VideoFileClip
videoslabeled = glob.glob(outputfolder + '/*.mp4')

# need to rerender
clip = VideoFileClip(videoslabeled[1])
clip.write_videofile("./temp/example_2_labeled.mp4")
Video("./temp/example_2_labeled.mp4", embed=True)
MoviePy - Building video ./temp/example_2_labeled.mp4.
MoviePy - Writing video ./temp/example_2_labeled.mp4

                                                      
MoviePy - Done !
MoviePy - video ready ./temp/example_2_labeled.mp4

Out[5]:
Your browser does not support the video tag.
In [6]:
# need to rerender
clip = VideoFileClip(videoslabeled[0])
clip.write_videofile("./temp/example_1_labeled.mp4")
Video("./temp/example_1_labeled.mp4", embed=True)
MoviePy - Building video ./temp/example_1_labeled.mp4.
MoviePy - Writing video ./temp/example_1_labeled.mp4

                                                      
MoviePy - Done !
MoviePy - video ready ./temp/example_1_labeled.mp4

Out[6]:
Your browser does not support the video tag.

Concluding remarks¶

It is important to test the accuracy of your classifier against some hand-labeled data that was not used to train your model on. Indeed, you would report a confusion matrix (e.g., false positive rate, hits, etc.) or you the machine-human interrater reliability. In the future I would like to add such code in this module, as well train a general model for detecting general gestures. Do you have data suitable for this and you would like to use it, you can contact me (wim.pouw@donders.ru.nl). In general it would be great to know if this module is valuable for your behaviors, and knowing the boundary conditions of this pipeline.